The goal is to build a machine learning algorithm uses the features as input and predicts a outcome variable (or air pollution levels) in the situation where we do not know the outcome variable. The way we do this is to use data where we have both the input and output data to train a machine learning algorithm. To train a machine learning algorithm, we will use the tidymodels package ecosystem.
Pre-processing the data
After splitting the data, the next step is to process the training and testing data so that the data are are compatible and optimized to be used with the model. This involves assigning variables to specific roles within the model and pre-processing like scaling variables and removing redundant variables. This process can also be called feature engineering.
To do this in tidymodels, we will create a what’s called a “recipe” using the recipes package, which is a standardized format for a sequence of steps for processing the data. This can be very useful because it makes testing out different pre-processing steps or different algorithms with the same pre-processing very easy and reproducible. Creating a recipe specifies how a data frame of predictors should be created - it specifies what variables to be used and the pre-processing steps, but it does not execute these steps or create the data frame of predictors.
Step 1: Specify the variables with recipe() function
The first thing to do to create a recipe is to specify which variables we will be using as our outcome and predictors using the recipe() function. In terms of the metaphor of baking, we can think of this as listing our ingredients. Translating this to the recipes package, we use the recipe() function to assign roles to all the variables.
Let’s try the simplest recipe with no pre-processing steps: simply list the outcome and predictor variables.
We can do so in two ways:
- Using formula notation
- Assigning roles to each variable
Let’s look at the first way using formula notation, which looks like this:
outcome(s) ~ predictor(s)
If in the case of multiple predictors or a multivariate situation with two outcomes, use a plus sign
outcome1 + outcome2 ~ predictor1 + predictor2
If we want to include all predictors we can use a period like so:
outcome_variable_name ~ .
Now with our data, we will start by making a recipe for our training data. If you recall, the continuous outcome variable is value (the average annual gravimetric monitor PM2.5 concentration in ug/m3). Our features (or predictor variables) are all the other variables except the monitor ID, which is an id variable.
The reason not to include the id variable is because this variable includes the county number and a number designating which particular monitor the values came from of the monitors there are in that county. Since this number is arbitrary and the county information is also given in the data, and the fact that each monitor only has one value in the value variable, nothing is gained by including this variable and it may instead introduce noise. However, it is useful to keep this data to take a look at what is happening later. We will show you what to do in this case in just a bit.
In the simplest case, we might use all predictors like this:
Data Recipe
Inputs:
role #variables
outcome 1
predictor 49
We see a recipe has been created with 1 outcome variable and 49 predictor variables (or features). Also, notice how we named the output of recipe(). The naming convention for recipe objects is *_rec or rec.
Now, let’s get back to the id variable. Instead of including it as a predictor variable, we could also use the update_role() function of the recipes package.
Data Recipe
Inputs:
role #variables
id variable 1
outcome 1
predictor 48
Click here learn more about the working with id variables
This option works well with the newer workflows package, however id variables are often dropped from analyses that do not use this newer package as they can make the process difficult with using the parsnip package alone due to the fact that new levels (or possible values) may be introduced with the testing data.
We could also specify the outcome and predictors in the same way as the id variable. Please see here for examples of other roles for variables. The role can be actually be any value.
The order is important here, as we first make all variables predictors and then override this role for the outcome and id variable. We will use the everything() function of the dplyr package to start with all of the variables in train_pm.
Data Recipe
Inputs:
role #variables
id variable 1
outcome 1
predictor 48
If we want to take a look at our formula from our recipe, we can do use the formula() function of the stats package.
value ~ fips + lat + lon + state + county + city + CMAQ + zcta +
zcta_area + zcta_pop + imp_a500 + imp_a1000 + imp_a5000 +
imp_a10000 + imp_a15000 + county_area + county_pop + log_dist_to_prisec +
log_pri_length_5000 + log_pri_length_10000 + log_pri_length_15000 +
log_pri_length_25000 + log_prisec_length_500 + log_prisec_length_1000 +
log_prisec_length_5000 + log_prisec_length_10000 + log_prisec_length_15000 +
log_prisec_length_25000 + log_nei_2008_pm25_sum_10000 + log_nei_2008_pm25_sum_15000 +
log_nei_2008_pm25_sum_25000 + log_nei_2008_pm10_sum_10000 +
log_nei_2008_pm10_sum_15000 + log_nei_2008_pm10_sum_25000 +
popdens_county + popdens_zcta + nohs + somehs + hs + somecollege +
associate + bachelor + grad + pov + hs_orless + urc2013 +
urc2006 + aod
<environment: 0x7ffeb8aa5fd8>
We can also view our recipe in more detail using the base summary() function.
# A tibble: 50 x 4
variable type role source
<chr> <chr> <chr> <chr>
1 id nominal id variable original
2 value numeric outcome original
3 fips nominal predictor original
4 lat numeric predictor original
5 lon numeric predictor original
6 state nominal predictor original
7 county nominal predictor original
8 city nominal predictor original
9 CMAQ numeric predictor original
10 zcta nominal predictor original
# … with 40 more rows
To summarize this step, we use the recipe() function to assign roles to all the variables:

Step 2: Specify the pre-processing steps with step*() functions
Next, we use the step*() functions from the recipe package to specify pre-processing steps.

This link and this link show the many options for recipe step functions.
There are step functions for a variety of purposes:
- Imputation – filling in missing values based on the existing data
- Transformation – changing all values of a variable in the same way, typically to make it more normal or easier to interpret
- Discretization – converting continuous values into discrete or nominal values - binning for example to reduce the number of possible levels (However this is generally not advisable!)
- Encoding / Creating Dummy Variables – creating a numeric code for categorical variables (More on Dummy Variables and one hot encoding)
- Data type conversions – which means changing from integer to factor or numeric to date etc.
- Interaction term addition to the model – which means that we would be modeling for predictors that would influence the capacity of each other to predict the outcome
- Normalization – centering and scaling the data to a similar range of values
- Dimensionality Reduction/ Signal Extraction – reducing the space of features or predictors to a smaller set of variables that capture the variation or signal in the original variables (ex. Principal Component Analysis and Independent Component Analysis)
- Filtering – filtering options for removing variables (ex. remove variables that are highly correlated to others or remove variables with very little variance and therefore likely little predictive capacity)
- Row operations – performing functions on the values within the rows (ex. rearranging, filtering, imputing)
- Checking functions – Sanity checks to look for missing values, to look at the variable classes etc.
All of the step functions look like step_*() with the * replaced with a name, except for the check functions which look like check_*().
There are several ways to select what variables to apply steps to:
- Using
tidyselect methods: contains(), matches(), starts_with(), ends_with(), everything(), num_range()
- Using the type:
all_nominal(), all_numeric() , has_type()
- Using the role:
all_predictors(), all_outcomes(), has_role()
- Using the name - use the actual name of the variable/variables of interest
Let’s try adding some steps to our recipe.
We might consider log transforming our population and area variables (that aren’t densities) - let’s take a look at the range of these variables.
AVOCADO: should we explain map()? AVOCADO: unclear why this is commented out? The next sentence suggests it was executed but I don’t think the code works?
We can see that the range for each of these variables is quite large, we can log transform this data using the step_log() function of the recipes package.
We would also want to potentially one hot encode some of our categorical variables so that they can be used with certain algorithms. We can do this with the step_dummy() function and the one_hot = TRUE argument. One hot encoding means that we do not simply encode our categorical variables numerically, as our numeric assignments can be interpreted by algorithms as having a particular rank or order. Instead, binary variables made of 1s and 0s are used to arbitrarily assign a numeric value that has no apparent order.
Data Recipe
Inputs:
role #variables
id variable 1
outcome 1
predictor 48
Operations:
Dummy variables from state, county, city, zcta
Our fips variable includes a numeric code for state and county - and therefore is essentially a proxy for county. Since we already have county, we will just use it and keep the fips ID as another ID variable.
We can remove the fips variable from the predictors using update_role() to make sure that the role is no longer "predictor". We can make the role anything we want actually, so we will keep it something identifiable.
Data Recipe
Inputs:
role #variables
county id 1
id variable 1
outcome 1
predictor 47
We might also want to remove variables that appear to be redundant and are highly correlated with others, as we know from our exploratory data analysis that many of our variables are correlated with one another. We can do this using the step_corr() function.
Data Recipe
Inputs:
role #variables
id variable 1
outcome 1
predictor 48
Operations:
Correlation filter on all_predictors, -, CMAQ, -, aod
Notice, we don’t want to remove some of our variables, like the CMAQ and aod variables.
It is also a good idea to remove variables with near-zero variance, which can be done with the step_nzv() function. Variables have low variance if all the values are very similar, the values are very sparse, or if they are highly imbalanced.
Data Recipe
Inputs:
role #variables
id variable 1
outcome 1
predictor 48
Operations:
Sparse, unbalanced variable filter on all_predictors, -, CMAQ, -, aod
Click here to learn about examples where you might have near-zero variance variables
- Similar Values - If the population density was nearly the same for every zcta that contained a monitor, then knowing the population density near our monitor would contribute little to our model in assisting us to predict monitor air pollution values.
- Sparse Data - If all of the monitors were in locations where the populations did not attend graduate school, then these values would mostly be zero, again this would do very little to help us distinguish our air pollution monitors.When many of the values are zero this is also called sparse data.
- Imbalanced Data If nearly all of the monitors were located in one particular state, and all the others only had one monitor each, then the real predictive value would simply be in knowing if a monitor is located in that particular state or not. In this case we don’t want to remove our variable, we just want to simplify it.
See this blog post about why removing near-zero variance variables isn’t always a good idea if we think that a variable might be especially informative.
Let’s put all this together now.
Remember: it is important to add the steps to the recipe in an order that makes sense just like with a cooking recipe.
First, we are going to create numeric values for our categorical variables, then we will look at correlation and near-zero variance. We do not want to remove the CMAQ and aod variables, so we can make sure they are kept in the model by excluding them from those steps. If we specifically wanted to remove a predictor we could use step_rm().
Data Recipe
Inputs:
role #variables
county id 1
id variable 1
outcome 1
predictor 47
Operations:
Dummy variables from state, county, city, zcta
Correlation filter on all_predictors, -, CMAQ, -, aod
Sparse, unbalanced variable filter on all_predictors, -, CMAQ, -, aod
Running the pre-processing
Step 1: Update the recipe with training data using prep()
The next major function of the recipes package is prep(). This function updates the recipe object based on the training data. It estimates parameters (estimating the required quantities and statistics required by the steps for the variables) for pre-processing and updates the model terms, as some of the predictors may be removed, this allows the recipe to be ready to use on other data sets. It does not necessarily actually execute the pre-processing itself, however we will specify in argument for it to do this so that we can take a look at the pre-processed data.
AVOCADO: I’m not sure I get what “updates the model terms” in the paragraph above means. I didn’t think a model had been specified yet?
There are some important arguments to know about:
training - you must supply a training data set to estimate parameters for pre-processing operations (recipe steps) - this may already be included in your recipe - as is the case for us
fresh - if fresh=TRUE, - will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe
verbose - if verbose=TRUE, shows the progress as the steps are evaluated and the size of the pre-processed training set
retain - if retain=TRUE, then the pre-processed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and do not want to rerun the prep() on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the pre-processed data.
Let’s try out the prep() function:
oper 1 step dummy [training]
oper 2 step corr [training]
oper 3 step nzv [training]
The retained training set is ~ 0.26 Mb in memory.
[1] "var_info" "term_info" "steps" "template"
[5] "levels" "retained" "tr_info" "orig_lvls"
[9] "last_term_info"
There are also lots of useful things to checkout in the output of prep(). You can see:
- the
steps that were run
- the variable info (
var_info)
- the model (
term_info)
- the new
levels of the variables
- the original levels of the variables (
orig_lvls)
- info about the training data set size and completeness (
tr_info)
Note: You may see the prep.recipe() function in material that you read about the recipes package. This is referring to the prep() function of the recipes package.
Rows: 584
Columns: 36
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1055.001,…
$ value <dbl> 9.597647, 10.800000, 11.212174, 12.375394…
$ fips <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 33.99375, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9910…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 9.241744, 9…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 154069359…
$ zcta_pop <dbl> 27829, 5103, 9042, 20045, 30217, 9010, 16…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 16.4…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 5.161210…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 13856…
$ county_pop <dbl> 182265, 13932, 54428, 104430, 101547, 658…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 5.261457, 7…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 9.066563, 8…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 12.01356, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 8.740680, 6…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 9.627898, 7…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 11.72888…
$ log_prisec_length_10000 <dbl> 11.886803, 11.405543, 12.840663, 12.76827…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 4.462…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 4.678311, 3…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 75.367038…
$ popdens_zcta <dbl> 145.7164307, 13.6395554, 540.8870404, 130…
$ nohs <dbl> 3.3, 11.6, 7.3, 4.3, 5.8, 7.1, 2.7, 11.1,…
$ somehs <dbl> 4.9, 19.1, 15.8, 13.3, 11.6, 17.1, 6.6, 1…
$ hs <dbl> 25.1, 33.9, 30.6, 27.8, 29.8, 37.2, 30.7,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 29.2, 21.4, 23.5, 25.7,…
$ associate <dbl> 8.2, 8.0, 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 10.0, 13.7, 5.9, 17.6, 7…
$ grad <dbl> 13.5, 3.1, 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, …
$ pov <dbl> 6.1, 19.5, 19.0, 8.8, 15.6, 25.5, 7.3, 8.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 45.4, 47.2, 61.4, 40.0,…
$ urc2013 <dbl> 4, 6, 4, 4, 4, 1, 1, 1, 1, 1, 2, 3, 3, 3,…
$ aod <dbl> 37.363636, 34.818182, 36.000000, 43.41666…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
For easy comparison sake - here is our original data:
Rows: 876
Columns: 50
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003…
$ value <dbl> 9.597647, 10.800000, 11.212174, 11.659091…
$ fips <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9683…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama…
$ county <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "…
$ city <chr> "Fairhope", "Ashland", "Muscle Shoals", "…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9…
$ zcta <fct> 36532, 36251, 35660, 35962, 35901, 36303,…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 203836235…
$ zcta_pop <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 90…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 5.78…
$ imp_a1000 <dbl> 1.4096021, 0.8531574, 11.1448962, 3.86764…
$ imp_a5000 <dbl> 3.3360118, 0.9851479, 15.1786154, 1.23114…
$ imp_a10000 <dbl> 1.9879187, 0.5208189, 9.7253870, 1.031646…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 0.973044…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 20126…
$ county_pop <dbl> 182265, 13932, 54428, 71109, 104430, 1015…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9…
$ log_pri_length_10000 <dbl> 9.210340, 9.210340, 9.274303, 10.409411, …
$ log_pri_length_15000 <dbl> 9.630228, 9.615805, 9.658899, 11.173626, …
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 10.21420…
$ log_prisec_length_10000 <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 1…
$ log_prisec_length_15000 <dbl> 12.205723, 12.042963, 13.282656, 12.35366…
$ log_prisec_length_25000 <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 1…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.350…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 35.330814…
$ popdens_zcta <dbl> 145.716431, 13.639555, 540.887040, 40.718…
$ nohs <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7,…
$ somehs <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, …
$ hs <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5,…
$ associate <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17…
$ grad <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, …
$ pov <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4,…
$ urc2013 <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ urc2006 <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ aod <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 4…
Notice how we only have 36 variables now instead of 50! Two of these are our ID variables (fips and the actual monitor ID (id)) and one is our outcome (value). Thus we only have 33 predictors now. We can also see that variables that we no longer have any categorical variables. Variables like state are gone and only state_California remains as it was the only state identity to have nonzero variance. We can also see that there were more monitors listed as "Not in a city" than any city.
We can see that California had the largest number of monitors compared to the other states.
Scroll through the output:
Scroll through the output:
Note: Recall that you must specify retain = TRUE argument of the prep() function to use juice().
Rows: 292
Columns: 36
$ id <fct> 1049.1003, 1073.101, 1073.2006, 1089.0014…
$ value <dbl> 11.659091, 13.114545, 12.228125, 12.23294…
$ fips <fct> 1049, 1073, 1073, 1089, 1103, 1121, 4013,…
$ lat <dbl> 34.28763, 33.54528, 33.38639, 34.68767, 3…
$ lon <dbl> -85.96830, -86.54917, -86.81667, -86.5863…
$ CMAQ <dbl> 8.534772, 9.303766, 10.235612, 9.343611, …
$ zcta_area <dbl> 203836235, 148994881, 56063756, 46963946,…
$ zcta_pop <dbl> 8300, 14212, 32390, 21297, 30545, 7713, 5…
$ imp_a500 <dbl> 5.78200692, 0.06055363, 42.42820069, 23.2…
$ imp_a15000 <dbl> 0.9730444, 2.9956557, 12.7487614, 10.3555…
$ county_area <dbl> 2012662359, 2878192209, 2878192209, 20761…
$ county_pop <dbl> 71109, 658466, 658466, 334811, 119490, 82…
$ log_dist_to_prisec <dbl> 3.721489, 7.301545, 4.721755, 4.659519, 6…
$ log_pri_length_5000 <dbl> 8.517193, 9.683336, 10.737240, 8.517193, …
$ log_pri_length_25000 <dbl> 11.90959, 12.53777, 12.99669, 11.47391, 1…
$ log_prisec_length_500 <dbl> 7.310155, 6.214608, 7.528913, 8.760549, 6…
$ log_prisec_length_1000 <dbl> 8.585843, 7.600902, 9.342290, 9.543183, 8…
$ log_prisec_length_5000 <dbl> 10.214200, 11.262645, 11.713190, 11.48606…
$ log_prisec_length_10000 <dbl> 11.50894, 12.14101, 12.53899, 12.68440, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 3.3500444, 6.6241114, 5.8268686, 3.861625…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.1719202, 7.5490587, 8.8205542, 5.219092…
$ popdens_county <dbl> 35.330814, 228.777633, 228.777633, 161.26…
$ popdens_zcta <dbl> 40.718962, 95.385827, 577.735106, 453.475…
$ nohs <dbl> 14.3, 7.2, 0.8, 1.2, 4.8, 16.7, 19.1, 6.4…
$ somehs <dbl> 16.7, 12.2, 2.6, 3.1, 7.8, 33.3, 15.6, 9.…
$ hs <dbl> 35.0, 32.2, 12.9, 15.1, 28.7, 37.5, 26.5,…
$ somecollege <dbl> 14.9, 19.0, 17.9, 20.5, 25.0, 12.5, 18.0,…
$ associate <dbl> 5.5, 6.8, 5.2, 6.5, 7.5, 0.0, 6.0, 8.8, 3…
$ bachelor <dbl> 7.9, 14.8, 35.5, 30.4, 18.2, 0.0, 10.6, 1…
$ grad <dbl> 5.8, 7.7, 25.2, 23.3, 8.0, 0.0, 4.1, 5.7,…
$ pov <dbl> 13.8, 10.5, 2.1, 5.2, 8.3, 18.8, 21.4, 14…
$ hs_orless <dbl> 66.0, 51.6, 16.3, 19.4, 41.3, 87.5, 61.2,…
$ urc2013 <dbl> 6, 1, 1, 3, 4, 5, 1, 2, 5, 4, 4, 6, 6, 1,…
$ aod <dbl> 33.08333, 42.45455, 44.25000, 42.41667, 4…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
$ city_Not.in.a.city <dbl> NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA…
Notice that our city_Not.in.a.city variable seems to be NA values. Why might that be?
Ah! Perhaps it is because some of our levels were not previously seen in the training set!
Let’s take a look using the set operations of the dplyr package. We can take a look at cities that were different between the test and training set.
[1] 376 1
[1] 51 1
Indeed, there are lots of different cities in our test data that are not in our training data!
AVOCADO: yeah, I would remove the paragraph below.
Maybe remove this?: Thus we need to update our original recipe to include a very important step function called step_novel() this helps in cases like this were there are new factors in our testing set that were not in our training set. It is a good idea to include this in most of your recipes where you have a categorical variables with many distinct values. This step needs to come before we create dummy variables. However, we are also creating a dummy variable from this, which still results in a problem.
Next, let go back to our pm data set and modify the city variable to be values of in a city or not in a city using the case_when() function of dplyr. This function allows you to vectorize multiple if_else() statements.
# A tibble: 876 x 50
id value fips lat lon state county city CMAQ zcta zcta_area
<fct> <dbl> <fct> <dbl> <dbl> <chr> <chr> <chr> <dbl> <fct> <dbl>
1 1003… 9.60 1003 30.5 -87.9 Alab… Baldw… In a… 8.10 36532 190980522
2 1027… 10.8 1027 33.3 -85.8 Alab… Clay In a… 9.77 36251 374132430
3 1033… 11.2 1033 34.8 -87.7 Alab… Colbe… In a… 9.40 35660 16716984
4 1049… 11.7 1049 34.3 -86.0 Alab… DeKalb In a… 8.53 35962 203836235
5 1055… 12.4 1055 34.0 -86.0 Alab… Etowah In a… 9.24 35901 154069359
6 1069… 10.5 1069 31.2 -85.4 Alab… Houst… In a… 9.12 36303 162685124
7 1073… 15.6 1073 33.6 -86.8 Alab… Jeffe… In a… 10.2 35207 26929603
8 1073… 12.4 1073 33.3 -87.0 Alab… Jeffe… Not … 10.2 35111 166239542
9 1073… 11.1 1073 33.5 -87.3 Alab… Jeffe… Not … 8.16 35444 385566685
10 1073… 13.1 1073 33.5 -86.5 Alab… Jeffe… In a… 9.30 35094 148994881
# … with 866 more rows, and 39 more variables: zcta_pop <dbl>, imp_a500 <dbl>,
# imp_a1000 <dbl>, imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>,
# county_area <dbl>, county_pop <dbl>, log_dist_to_prisec <dbl>,
# log_pri_length_5000 <dbl>, log_pri_length_10000 <dbl>,
# log_pri_length_15000 <dbl>, log_pri_length_25000 <dbl>,
# log_prisec_length_500 <dbl>, log_prisec_length_1000 <dbl>,
# log_prisec_length_5000 <dbl>, log_prisec_length_10000 <dbl>,
# log_prisec_length_15000 <dbl>, log_prisec_length_25000 <dbl>,
# log_nei_2008_pm25_sum_10000 <dbl>, log_nei_2008_pm25_sum_15000 <dbl>,
# log_nei_2008_pm25_sum_25000 <dbl>, log_nei_2008_pm10_sum_10000 <dbl>,
# log_nei_2008_pm10_sum_15000 <dbl>, log_nei_2008_pm10_sum_25000 <dbl>,
# popdens_county <dbl>, popdens_zcta <dbl>, nohs <dbl>, somehs <dbl>,
# hs <dbl>, somecollege <dbl>, associate <dbl>, bachelor <dbl>, grad <dbl>,
# pov <dbl>, hs_orless <dbl>, urc2013 <dbl>, urc2006 <dbl>, aod <dbl>
Alternatively you could create a custom step function to do this and add the step function to your recipe, but that is beyond the scope of this case study.
We will need to repeat all the steps (splitting the data, pre-processing, etc) as the levels of our variables have now changed.
While we are doing this, we might also have this issue for state and county. So let’s also do a similar thing for state. The county variables appears to get dropped due to either correlation or near zero variance. It is likely due to near zero variance because this is the more granular of these geographic categorical variables and likely sparse.
<Analysis/Assess/Total>
<584/292/876>
Now let’s retrain our training data and try baking our test data:
oper 1 step dummy [training]
oper 2 step corr [training]
oper 3 step nzv [training]
The retained training set is ~ 0.26 Mb in memory.
Rows: 584
Columns: 36
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1055.001,…
$ value <dbl> 9.597647, 10.800000, 11.212174, 12.375394…
$ fips <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 33.99375, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9910…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 9.241744, 9…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 154069359…
$ zcta_pop <dbl> 27829, 5103, 9042, 20045, 30217, 9010, 16…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 16.4…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 5.161210…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 13856…
$ county_pop <dbl> 182265, 13932, 54428, 104430, 101547, 658…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 5.261457, 7…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 9.066563, 8…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 12.01356, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 8.740680, 6…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 9.627898, 7…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 11.72888…
$ log_prisec_length_10000 <dbl> 11.886803, 11.405543, 12.840663, 12.76827…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 4.462…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 4.678311, 3…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 75.367038…
$ popdens_zcta <dbl> 145.7164307, 13.6395554, 540.8870404, 130…
$ nohs <dbl> 3.3, 11.6, 7.3, 4.3, 5.8, 7.1, 2.7, 11.1,…
$ somehs <dbl> 4.9, 19.1, 15.8, 13.3, 11.6, 17.1, 6.6, 1…
$ hs <dbl> 25.1, 33.9, 30.6, 27.8, 29.8, 37.2, 30.7,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 29.2, 21.4, 23.5, 25.7,…
$ associate <dbl> 8.2, 8.0, 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 10.0, 13.7, 5.9, 17.6, 7…
$ grad <dbl> 13.5, 3.1, 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, …
$ pov <dbl> 6.1, 19.5, 19.0, 8.8, 15.6, 25.5, 7.3, 8.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 45.4, 47.2, 61.4, 40.0,…
$ urc2013 <dbl> 4, 6, 4, 4, 4, 1, 1, 1, 1, 1, 2, 3, 3, 3,…
$ aod <dbl> 37.363636, 34.818182, 36.000000, 43.41666…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
Notice, it looks like we gained the log_prisec_length_25000 back with this recipe using the data with our changes to state and city.
Rows: 292
Columns: 36
$ id <fct> 1049.1003, 1073.101, 1073.2006, 1089.0014…
$ value <dbl> 11.659091, 13.114545, 12.228125, 12.23294…
$ fips <fct> 1049, 1073, 1073, 1089, 1103, 1121, 4013,…
$ lat <dbl> 34.28763, 33.54528, 33.38639, 34.68767, 3…
$ lon <dbl> -85.96830, -86.54917, -86.81667, -86.5863…
$ CMAQ <dbl> 8.534772, 9.303766, 10.235612, 9.343611, …
$ zcta_area <dbl> 203836235, 148994881, 56063756, 46963946,…
$ zcta_pop <dbl> 8300, 14212, 32390, 21297, 30545, 7713, 5…
$ imp_a500 <dbl> 5.78200692, 0.06055363, 42.42820069, 23.2…
$ imp_a15000 <dbl> 0.9730444, 2.9956557, 12.7487614, 10.3555…
$ county_area <dbl> 2012662359, 2878192209, 2878192209, 20761…
$ county_pop <dbl> 71109, 658466, 658466, 334811, 119490, 82…
$ log_dist_to_prisec <dbl> 3.721489, 7.301545, 4.721755, 4.659519, 6…
$ log_pri_length_5000 <dbl> 8.517193, 9.683336, 10.737240, 8.517193, …
$ log_pri_length_25000 <dbl> 11.90959, 12.53777, 12.99669, 11.47391, 1…
$ log_prisec_length_500 <dbl> 7.310155, 6.214608, 7.528913, 8.760549, 6…
$ log_prisec_length_1000 <dbl> 8.585843, 7.600902, 9.342290, 9.543183, 8…
$ log_prisec_length_5000 <dbl> 10.214200, 11.262645, 11.713190, 11.48606…
$ log_prisec_length_10000 <dbl> 11.50894, 12.14101, 12.53899, 12.68440, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 3.3500444, 6.6241114, 5.8268686, 3.861625…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.1719202, 7.5490587, 8.8205542, 5.219092…
$ popdens_county <dbl> 35.330814, 228.777633, 228.777633, 161.26…
$ popdens_zcta <dbl> 40.718962, 95.385827, 577.735106, 453.475…
$ nohs <dbl> 14.3, 7.2, 0.8, 1.2, 4.8, 16.7, 19.1, 6.4…
$ somehs <dbl> 16.7, 12.2, 2.6, 3.1, 7.8, 33.3, 15.6, 9.…
$ hs <dbl> 35.0, 32.2, 12.9, 15.1, 28.7, 37.5, 26.5,…
$ somecollege <dbl> 14.9, 19.0, 17.9, 20.5, 25.0, 12.5, 18.0,…
$ associate <dbl> 5.5, 6.8, 5.2, 6.5, 7.5, 0.0, 6.0, 8.8, 3…
$ bachelor <dbl> 7.9, 14.8, 35.5, 30.4, 18.2, 0.0, 10.6, 1…
$ grad <dbl> 5.8, 7.7, 25.2, 23.3, 8.0, 0.0, 4.1, 5.7,…
$ pov <dbl> 13.8, 10.5, 2.1, 5.2, 8.3, 18.8, 21.4, 14…
$ hs_orless <dbl> 66.0, 51.6, 16.3, 19.4, 41.3, 87.5, 61.2,…
$ urc2013 <dbl> 6, 1, 1, 3, 4, 5, 1, 2, 5, 4, 4, 6, 6, 1,…
$ aod <dbl> 33.08333, 42.45455, 44.25000, 42.41667, 4…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Great now we no longer have NA values! :)
Note: if you use the skip option for some of the pre-processing steps, be careful. juice() will show all of the results ignoring skip = TRUE. bake() will not necessarily conduct these steps on the new data.